fix(amber): add timeout and retry to dataset file-service requests#5667
fix(amber): add timeout and retry to dataset file-service requests#5667Ma77Ball wants to merge 7 commits into
Conversation
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 480 | 0.293 | 20,137/28,757/28,757 us | 🔴 -11.8% / 🟢 -17.8% |
| ⚪ | bs=100 sw=10 sl=64 | 1,226 | 0.748 | 80,682/96,233/96,233 us | ⚪ within ±5% / 🟢 +37.5% |
| ⚪ | bs=1000 sw=10 sl=64 | 1,431 | 0.873 | 701,624/730,806/730,806 us | ⚪ within ±5% / 🟢 +37.5% |
Baseline details
Latest main dfa0434 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 480 tuples/sec | 544 tuples/sec | 410.82 tuples/sec | -11.8% | +16.8% |
| bs=10 sw=10 sl=64 | MB/s | 0.293 MB/s | 0.332 MB/s | 0.251 MB/s | -11.7% | +16.9% |
| bs=10 sw=10 sl=64 | p50 | 20,137 us | 18,027 us | 23,785 us | +11.7% | -15.3% |
| bs=10 sw=10 sl=64 | p95 | 28,757 us | 27,273 us | 34,980 us | +5.4% | -17.8% |
| bs=10 sw=10 sl=64 | p99 | 28,757 us | 27,273 us | 34,980 us | +5.4% | -17.8% |
| bs=100 sw=10 sl=64 | throughput | 1,226 tuples/sec | 1,196 tuples/sec | 891.94 tuples/sec | +2.5% | +37.5% |
| bs=100 sw=10 sl=64 | MB/s | 0.748 MB/s | 0.73 MB/s | 0.544 MB/s | +2.5% | +37.4% |
| bs=100 sw=10 sl=64 | p50 | 80,682 us | 83,073 us | 112,277 us | -2.9% | -28.1% |
| bs=100 sw=10 sl=64 | p95 | 96,233 us | 100,618 us | 139,802 us | -4.4% | -31.2% |
| bs=100 sw=10 sl=64 | p99 | 96,233 us | 100,618 us | 139,802 us | -4.4% | -31.2% |
| bs=1000 sw=10 sl=64 | throughput | 1,431 tuples/sec | 1,460 tuples/sec | 1,041 tuples/sec | -2.0% | +37.5% |
| bs=1000 sw=10 sl=64 | MB/s | 0.873 MB/s | 0.891 MB/s | 0.635 MB/s | -2.0% | +37.4% |
| bs=1000 sw=10 sl=64 | p50 | 701,624 us | 687,812 us | 972,714 us | +2.0% | -27.9% |
| bs=1000 sw=10 sl=64 | p95 | 730,806 us | 711,321 us | 1,023,057 us | +2.7% | -28.6% |
| bs=1000 sw=10 sl=64 | p99 | 730,806 us | 711,321 us | 1,023,057 us | +2.7% | -28.6% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,416.57,200,128000,480,0.293,20137.10,28756.87,28756.87
1,100,10,64,20,1631.13,2000,1280000,1226,0.748,80682.38,96232.83,96232.83
2,1000,10,64,20,13977.21,20000,12800000,1431,0.873,701624.06,730806.35,730806.35
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5667 +/- ##
============================================
+ Coverage 52.91% 52.95% +0.03%
+ Complexity 2626 2481 -145
============================================
Files 1090 1075 -15
Lines 42188 41678 -510
Branches 4531 4503 -28
============================================
- Hits 22324 22070 -254
+ Misses 18555 18315 -240
+ Partials 1309 1293 -16
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
file-service requests DatasetFileDocument made two requests.get() calls (presigned-URL fetch and file download) with no timeout, so a hung or unreachable file-service would block the worker thread indefinitely. Route both calls through a Session with a (10s connect, 60s read) timeout and a urllib3 Retry policy (3 retries, exponential backoff, retrying on connection errors and 5xx). Both calls are idempotent GETs, so retrying is safe.
977e455 to
206ac02
Compare
…-document-request-timeout
|
@Yicong-Huang would you be able to review this when you get a chance? It hardens |
|
Sure. Adding @bobbai00 as well PTAL |
Yicong-Huang
left a comment
There was a problem hiding this comment.
LGTM in general, please see inline comments
| import requests | ||
| import urllib.parse | ||
| from requests.adapters import HTTPAdapter | ||
| from urllib3.util.retry import Retry |
There was a problem hiding this comment.
I checked we did not require urllib3 as a dependency, it was available due to a transitive dependency from requests.
we should not import a transitive dependency directly: if we import something then it is our dependency.
I think there are two ways to go:
- we can add
urllib3to dependency, officially. This actually might be a good idea for the long run. - the retry logic can be quite simple to implement with vanilla python try catch, we could do it without
urllib3.
I would recommend 1 in this case.
| _RETRY_STATUS_FORCELIST = (500, 502, 503, 504) | ||
|
|
||
| @classmethod | ||
| def _build_session(cls) -> requests.Session: |
There was a problem hiding this comment.
if you are using this session only for retry, as you mentioned in the doc string, it might be better to rename it _retry_session()
and use with
with _retry_session() as session:
// your logic
| _CONNECT_TIMEOUT_SECONDS = 10 | ||
| _READ_TIMEOUT_SECONDS = 60 |
There was a problem hiding this comment.
I feel generally we can use a shorter timeout. if retry 3 times with 60s each that's already 3 minutes. let's make it a few seconds or shorter?
What changes were proposed in this PR?
DatasetFileDocument's presigned-URL fetch and file download through arequests.Sessionconfigured with a(10s connect, 60s read)timeout, so a hung or unreachable file-service fails in bounded time instead of blocking the worker thread forever.urllib3Retrypolicy on the session (3 retries, exponential backoff, retrying on connection errors and 5xx). Both calls are idempotent GETs, so the retry set is restricted toGET.RuntimeError, consistent with the module's existing failure handling, so callers get a uniform error contract instead of a rawrequests/urllib3exception.Any related issues, documentation, discussions?
Closes: #5666
How was this PR tested?
pytestcoverage intest_dataset_file_document.py(26 tests):(connect, read)timeout tuple is passed on both the presigned-URL request and the file download;http://andhttps://with the expected policy (total=3,connect=3,read=3,backoff_factor=0.5,status_forcelist={500,502,503,504}, GET-only);ReadTimeout/ConnectionErroris wrapped inRuntimeErroron both code paths.ruff checkandruff format --checkpass on the modified files.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF